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ABSTRACT 

Function  approximation  is  necessary  when  applying  RL  to  either  Markov  decision 
processes  (MDPs)  or  semi-Markov  decision  processes  (SMDPs)  with  very  large  state 
spaces.  An  often  overlooked  issue  in  approximating  (2-ftinctions  in  either  framework 
arises  when  an  action-value  update  in  a  given  state  causes  a  large  policy  change  in  other 
states.  Another  way  of  stating  this  is  to  say  that  a  small  change  in  the  ^-function  results  in 
a  large  change  in  the  implied  greedy  policy.  We  call  this  sensitivity  to  changes  in  the  Q- 
function  the  dynamic  range  problem  and  suggest  that  it  may  result  in  greatly  increasing 
the  number  of  training  updates  required  to  accurately  approximate  the  optimal  policy.  We 
demonstrate  that  Advantage  Learning  solves  the  dynamic  range  problem  in  both 
frameworks,  and  is  more  robust  than  some  other  RL  algorithms  on  these  problems.  For  an 
MDP,  the  Advantage  Learning  algorithm  addresses  this  issue  by  re-scaling  the  dynamic 
range  of  action  values  within  each  state  by  a  constant.  For  SMDPs  the  scaling  constant 
can  vary  for  each  action. 
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1.  INTRODUCTION 

Reinforcement  learning  (RL)  systems  are  commonly  used  to  solve  Markov  Decision 
Processes  (MDPs),  tasks  in  which  the  interval  between  decisions  is  fixed.  RL  systems 
should  be  general  enough  to  solve  tasks  that  require  performing  actions  at  a  mixture  of 
time  scales  (i.e.,  the  RL  system  makes  decisions  concerning  low-level,  as  well  as  high- 
level,  actions).  Such  tasks  are  commonly  called  semi-Markov  decision  processes 
(SMDPs)  and  differ  from  MDPs  in  that  the  interval  between  decisions  varies.  People 
sometimes  use  hierarchies  for  solving  large  MDPs.  When  doing  so,  the  problem  is  by 
definition  an  SMDP. 

One  way  to  solve  MDPs  and  SMDPs  is  to  find  an  approximation  of  the  action-value 
function  for  the  task.  The  optimal  policy  can  easily  be  extracted  by  choosing  the  action 
with  the  greatest  approximated  value  in  each  state.  One  of  the  most  common  RL 
algorithms  for  finding  action-value  functions  is  ^-learning  (Watkins,  1989).  For  large 
MDPs  and  SMDPs  it  may  be  necessary  to  combine  0-learning  with  general  function 
approximation  to  find  approximations  of  the  action-value  functions  (0-flinctions). 

For  a  given  function  approximator  and  task  (MDP  or  SMDP),  it  may  be  the  case  that  an 
action-value  update  in  a  given  state  always  causes  a  large  policy  change  in  other  states.  In 
other  words,  a  small  change  in  the  0-function  will  result  in  a  large  change  in  the  implied 
greedy  policy.  We  call  this  the  dynamic  range  problem  and  suggest  that  it  may  result  in 
greatly  increasing  the  number  of  training  updates  required  to  accurately  approximate  the 
optimal  policy. 

We  demonstrate  that  Advantage  Learning  solves  the  dynamic  range  problem  in  both  MDP 
and  SMDP  frameworks,  and  is  more  robust  than  0-learning  on  these  problems. 

Advantage  Learning  and  its  forerunner.  Advantage  Updating,  a  generalization  of  the  0- 
learning  algorithm,  have  been  discussed  previously  in  the  context  of  continuous-time 
MDPs  (Puterman,  1994)  in  Baird  (1993,  1994),  Flarmon,  Baird,  and  Klopf  (1995,  1996), 
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Baird,  Harmon,  and  Klopf  (1996),  and  Harmon  and  Baird  (1996a,  1996b).’  Here  we 
present  a  general  framework  for  discussing  issues  of  dynamic  range  in  action-value 
functions.  We  then  use  this  framework  for  analyzing  the  action-value  functions  learned  by 
the  g-learning  algorithm  for  both  MDPs  and  SMDPs,  which  we  call  ^-functions  following 
common  practice.  We  describe  features  we  consider  desirable  in  action-value  functions 
and  derive  an  operator  that  maps  a  given  0-function  to  our  desired  function,  the 
Advantage  function.  We  provide  theoretical  justification  and  empirical  evidence  of  the 
desirability  of  approximating  Advantage  functions  instead  of  0-functions. 

Section  2  provides  the  notation  used  throughout  this  paper  and  background  information 
on  MDPs  and  RL,  Section  3  discusses  issues  involved  in  accurately  approximating  0- 
functions.  In  Section  4  the  concept  of  an  Advantage  function  is  derived  and  empirical 
results  are  presented  for  both  MDPs  and  SMDPs  that  illustrate  the  properties  of  this 
function.  Section  5  presents  alternatives  to  using  Advantage  functions  and  includes 
closing  remarks. 

2.  BACKGROUND  AND  NOTATION 

RL  systems  typically  use  a  set  of  real-valued  parameters  to  store  the  information  that  is 
learned.  When  a  parameter  is  updated  during  learning,  the  notations  •<-  k  represents  the 
operation  of  instantaneously  changing  the  parameter  w  so  that  its  new  value  is  k,  whereas 
— k  represents  the  operation  of  moving  the  value  of  w  toward  k.  This  is  equivalent 
to  (1  -  -t-  ok  where  the  step  size  parameter  a  is  a  small  positive  number. 

The  functions  stored  in  a  learning  system  at  a  given  time  are  represented  by  variables 
without  superscripts  such  as  ti,  V,  A,  or  0.  The  optimal  functions  that  are  being 
approximated  are  represented  by  *  superscripts,  such  as  7t',  L*,  A*,  or  0*. 


’  A  summary  of  these  results  is  given  in  Appendix  A. 
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2.1  Markov  Decision  Processes 

A  Markov  decision  process  (MDP)  is  a  system  that  changes  its  state  based  upon  its 
current  state  and  an  action  chosen  by  a  controller.  The  set  of  possible  states  and  the  set  of 
possible  actions  may  each  be  finite  or  infinite.  At  time  t,  the  controller  chooses  an  action, 
M, ,  based  upon  the  state  of  the  MDP,  .  The  MDP  then  transitions  to  a  new  state,  . 
The  state  transition  may  be  stochastic,  but  the  probability,  P{  ),  of  transitioning 

from  state  x,  to  state  jc,  after  receiving  action  is  a  function  of  only  x, ,  x, , , ,  and  w, , 
and  is  not  affected  by  previous  states  or  actions.  The  MDP  also  sends  the  controller  a 
scalar  value  known  as  a  reward.  The  expected  reward  received  by  the  controller  as  a 
result  of  the  transition  from  state  x,  to  Xt+i  is  R(  x, ,  ?/, ). 

A  policy,  71,  is  a  function  that  specifies  a  particular  action  for  the  controller  to  perform  in 
each  state  x.  The  expected  total  discounted  reward  associated  with  a  state  x  is  tne 
expected  sum  of  rewards  received  when  starting  in  that  state:  E  p/  V,  |Xp  =  ^|  >  where 

r^  is  the  reward  received  at  time  step  t.  The  discount  factor  y,  0<y<l,  is  a  parameter  that 

determines  the  relative  significance  of  earlier  versus  later  reward.  The  value  of  a  state  x, 
V’^(x),  is  the  expected  total  discounted  reward  received  for  starting  in  state  x  and 
choosing  all  actions  according  to  a  given  policy  tt  .  The  value  of  an  action,  Q’'  (x,m)  ,  is 
the  expected  total  discounted  reward  for  starting  in  state  x,  performing  action  u,  and  then 
choosing  all  actions  according  to  the  given  policy;^ . 

An  optimal  policy,  n* ,  for  a  given  MDP  is  a  policy  such  that  choosing  m,  =  re*  Ix^') 

results  in  maximizing  the  expected  total  discounted  reward  for  any  choice  of  starting  state. 
If  there  are  a  finite  number  of  states  and  actions,  and  y  <1,  then  at  least  one  optimal 

policy  is  guaranteed  to  exist. 

2.2  State- Value  Functions  and  Action-Value  Functions 

Roughly,  the  goal  of  RL  is  to  find  an  optimal  policy,  tr* .  Policies  can  be  extracted  from 
either  state-value  functions  (functions  of  state  only)  or  action-value  functions  (functions  of 
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both  state  and  action).  State-value  functions  are  functions  of  only  state;  action-value 
functions  are  functions  of  both  state  and  action.  Consequently,  RL  algorithms  can 
generally  be  grouped  into  two  categories;  those  that  approximate  state-value  functions, 
and  those  that  approximate  action- value  functions. 

The  goal  of  the  RL  system  is  to  find  a  function  V  that  satisfies  the  following  equation  for 
all  Xi’. 


V(x,)  =  max£:[i?(x,,?/)  +  ;'F(x,,,)],  (1) 

where  E  indicates  the  expected  value  of  performing  action  u  in  state  x, . 

The  unique  solution  to  equation  1  is  the  optimal  state-value  function  V* .  Equation  1  is 

called  the  Bellman  equation  in  dynamic  programming  (Bertsekas,  1987).  The  optimal 
policy,.;?*,  can  be  extracted  from  V*hy  letting  n:(x^)  =  ?xgmaxE\R{x^,u)  +y\^* . 


The  goal  of  an  RL  system  using  0-learning  is  to  find  a  function  Q  that  satisfies  the 
following  equation  for  all  (x,m)  pairs: 


0(x,,m,)  =  £ 


,^t)  +  r  max Q(x,^ , , ) 


(2) 


The  unique  solution  to  Equation  2  is  the  optimal  act'on-value  function,  Q  .  Equation  2  is 

the  Bellman  equation  for  0-learning.  The  optimal  policy,  tt*  ,  can  be  extracted  from  Q* 
by  letting  ;;r(x, )  =  arg  max  (x, ,  m)  . 


3.  FOUNDATIONS:  POLICY  REPRESENTATION,  DYNAMIC  RANGE, 
AND  SENSITIVITY 

3.1.  The  Method  of  Extracting  n  Has  Significant  Consequences 

The  choice  of  function  to  approximate  V*  or  Q*  can  have  a  profound  effect  on  the 

degree  of  accuracy  needed  in  the  approximation  before  the  implied  policy,  n ,  equals  the 
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optimal  policy,  n  ,  in  all  states.  To  explain  further  and  develop  intuition,  we  present  the 
following  deterministic  MDP.  The  set  of  states  is  X  =  {o,.01,.02,...l} .  There  is  a  single 

absorbing  state,  x=l. 

This  MDP  can  be  visualized  as  a  hallway  with  a  door  at  one  end  (jr=l).  Given  any  initial 
state,  the  goal  is  to  maximize  the  total  undiscounted  reward  necessary  to  exit  the  hallway. 
Two  actions  are  possible  in  every  state:  step  forward  +.01 ),  or  step  backward 

( 1  =  “Oi  X  with  the  exception  of  the  state  x=0  in  which  both  actions  result  in  x=0.01 

Each  action  results  in  a  negative  unit  reward  (-1).  The  optimal  state-value  function,  V\ 
and  optimal  ^-function,  0*,  for  this  MDP  are  graphed  below. 
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Assuming  that  the  state-value  function  is  approximated  by  an  affine  function  of  the  state,  a 
very  small  degree  of  accuracy  is  needed  in  the  state-value  function  approximation  before 
n:(x)  =  7t*{x)  for  all  x.  All  that  is  required  is  for  the  slope  of  the  approximation  to  have 
the  correct  sign.  In  other  words,  if  V(x  +  1)>V (x)  for  all  x,  then  ;t(x)  =  .;r*(x)  for  all  x. 
Because  we  have  assumed  a  function  approximator  that  generalizes  well,  very  few  training 
samples  will  be  needed  to  achieve  this  result. 

However,  this  is  not  the  case  for  the  Q-  function.  Here  the  policy  is  extracted  by 
comparing  the  relative  action  values  within  a  given  state.  The  action  with  the  greatest 
value  is  the  policy’s  choice  for  that  state.  This  necessarily  means  that  the  action  the 
approximation  implies  has  the  greatest  value  must  correspond  to  the  action  with  the 
greatest  value  in  the  optimal  0-function  for  the  same  state,  and  this  must  be  true  for  all 
states.  In  the  above  example,  Q\xjbrwarci)  >  Q*(x, backward)  for  all  x  (with  the 
exception  of  x=0).  Therefore,  the  approximation,  Q,  must  accurately  reflect  this  ordering 
for  all  X.  In  short,  in  the  state-value  function  it  takes  a  large  change  in  parameters  to 
change  the  slope  enough  to  change  the  policy,  but  in  the  0-function  it  takes  only  a  tiny 
change  in  the  parameters  to  raise  one  line  above  the  other,  resulting  in  a  change  in  policy. 

Generalization  is  the  term  often  used  to  refer  to  the  change  in  the  value  of  one  state  as  a 
result  of  training  in  another  state.  In  the  above  example,  when  approximating  V* , 
generalization  helps  to  quickly  find  an  approximation  that  results  in  the  correct  policy  in 
every  state.  However,  as  we  demonstrate  below,  when  approximating  0" ,  generalization 

may  be  a  hindrance.  Moreover,  we  demonstrate  that  one  can  alleviate  the  problem  with  a 
simple  transformation  of  the  0-function  that  results  in  a  new  action-value  function  that  is 
far  easier  to  approximate. 

3.2.  Dynamic  Range  and  Sensitivity 

What  properties  of  action-value  functions  make  it  difficult  to  achieve  an  implied  policy  of 
71*1  Should  we  simply  use  a  different  function  approximator?  Or,  for  a  given  function 
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approximator,  can  we  transform  the  action-value  function  in  such  a  way  that  we  can 
achieve  an  implied  policy  of  n*  with  fewer  training  examples? 

To  answer  these  questions,  we  now  introduce  a  framework  for  discussing  several 
properties  of  action-value  functions. 

3.2.1.  Dynamic  Range 

An  important  consideration  is  the  relationship  between  the  dynamic  range  of  action  values 
in  a  given  state  and  the  dynamic  range  of  state  values.  We  define  two  variables,  and 
Z)aa,  where  sa  refers  to  state-action  and  aa  refers  to  action-action,  to  represent  the  two 
distinct  dynamic  range  ratios.  Assuming  a  finite  state  set,  the  dynamic  range  over  state 
values  is  defined  as 


d^,  =  maxr(x)  -  minF(x) . 

X  X 

Similarly,  the  dynamic  range  over  all  action  values  within  a  given  state  is  defined  as 

=  niax^(x,M)  -  min0(x,M) . 

u  ll 

We  use  the  subscript  of  avl  for  this  quantity  to  distinguish  it  from  a  different  dynamic 
range  of  action  values  within  a  given  state.  Namely,  we  define  dav2{x,n)  as 


max  Q{x,  u) 

u 


Qix,u) . 


We  define  the  state-action  dynamic  range  ratio,  Aa(jc),  to  be  d^,  jd^^  (x)  with  (x)  =  1 
if  (x)  =  0 .  The  action-action  dynamic  range  ratio,  DJ^x,u),  is  defined  as 

again  with  D^^{x,u)  -  1  if  d^^,^{x,u)  =  0.  Although  presented  here 
for  completeness,  our  discussion  of  action-value  functions  will  not  include  issues  of 
action-action  dynamic  range  ratio,  Aa(^,w)>  uittil  we  address  functions  relevant  to  the 
SMDP  framework  in  Section  4.3. 
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Consider  the  action-value  function  presented  below.  The  state-value  dynamic  range,  dsv, 
is  V(\)  -V(0)  =  0  -  (-1)  =  1 .  The  action  value  dynamic  range  over  all  actions  in  state  .75, 
dav\{.15),  is  max^(.75,M)  -  min0(.75,M)  =  -.25 -(-.75)  =:.5.  Therefore, 

u  u 

A.(.75)  =  1/.5  =  2. 


0.5  States  10 


3.2.2  Sensitivity 

Another  property  of  action-value  functions  to  consider  is  the  degree  to  which  the  implied 
policy  is  changed  as  a  result  of  a  training  update  of  the  function  approximator.  The  degree 
to  which  the  policy  changes  is  a  function  of  both  the  function  approximator  and  Q* .  Here, 
we  want  to  know  how  a  different  choice  of  Q*  would  affect  this  policy  sensitivity. 


Let  represent  the  difference  between  the  value  of  the  state,  F(x),  and  the  maximum 
action  value  over  the  sub-optimal  actions: 

g(x)  =  V(x)-  max  Q(x,u). 

uIugU,Q(x,u)<^F(x) 

Given  g(x)  we  can  define  Ag(x) ,  the  fractional  change  in  g(x)  resulting  from  a  single 
update  to  the  parameter  vector  of  the  function  approximator  transforming  g(x)  to  g(x' ) ; 


+  l  (x)  = 


g(x')-g(x) 

g(x) 


Define  the  error  in  the  approximation  of  the  action  value  to  be 

E{x,u)  =|0*(x,?/)-  0(x,w)|,  the  fractional  change  in  the  error  as  a  result  of  a  single 
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update  to  the  parameter  vector  of  the  function  approximator  upon  training  on  the  state- 
action  pair  (x,u)  is 


AE{x,u) 


E'{x,u)~  E(x,u) 
E{x,n) 


For  any  triple  (x' ,  x,  w) ,  x,x'  eX ,  x  ^  x' ,  u  gU  ,  we  define  sensitivity,  S(x' ,  x,  u) ,  to  be 
the  amount  that  g  changes  in  state  x'  as  a  result  of  the  change  in  the  error  in  the 
approximation  of  an  action  value  in  a  different  state  x: 


S{x',x,u)  = 


XE{x,ii) 


If  sensitivity  is  small,  then  the  change  in  the  error  in  the  approximation  of  the  value  for 
state-action  pair  {x,u)  will  have  little  effect  on  g  in  state  x' .  If  sensitivity  is  high,  then  a 
small  change  in  the  approximation  of  the  value  for  state-action  pair  {x,u)  will  have  a  large 
effect  on  g  in  state  x  .  If  S(  x'  ,x,u)>  1 ,  then  training  enough  in  (x,u)  to  eliminate  the  error 
there  changes  the  policy  in  state  x' .  If  S(  x'  ,x, ?/)<!,  then  training  enough  in  (x,u)  to 
eliminate  the  error  will  not  change  the  policy  in  state  x' 


3.3.  Are  Sensitivity  and  Dynamic  Range  Related? 

Consider  an  MDP  with  two  actions  available  in  each  state,  m,  and  u, .  The  optimal  action 
values  Q  (x,?y,)  and  Q  represent  the  long-term  reward  expected  when  starting  in 
state  X  and  performing  action  i/,  or  respectively  for  a  single  step,  followed  by  optimal 
actions  thereafter.  In  a  typical  RL  problem  with  a  large  (or  continuous)  state  space,  it  is 
frequently  the  case  that  performing  one  wrong  action  in  a  long  sequence  of  optimal 
actions  has  little  effect  on  the  total  reward.  In  such  a  case,  0*(x,w,)and  Q*{x,u^)  are 

relatively  close  (i.e.,  a  small  dynamic  range  da,i{x)).  On  the  other  hand,  the  values  of 
widely  separated  states  typically  will  not  be  close  to  each  other.  Therefore  mai. Q\x^,u) 

and  max^  (x^.w)  may  differ  greatly  for  some  choices  of  Xj  and  x^  resulting  in  a  large 
dsv,  which  in  turn  results  in  a  large  Ao- 
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The  policy  implied  by  a  O-function  for  a  given  state  is  determined  by  the  relative  action 
values  within  that  state.  Consider  the  case  where  updating  causes  a  change  in 

action  values  in  x' ,  and  updating  Q(  x' '  ,u)  causes  a  change  in  action  values  in  x  as  we 
approach  the  desired  function  Q* .  Our  concern  is  only  with  what  happens  as  the 
approximation  Q  approaches  the  desired  function  Q* .  Early  in  the  learning  process  it  is 
reasonable  to  assume  large  policy  changes.  If  the  ^-function  is  stored  in  a  function 
approximation  system  that  generalizes,  the  sensitivity,  in  the  implied  policy  will  grow  as 
Dsa  grows.  In  cases  where  the  penalty  is  small  for  one  wrong  action  in  a  sequence  of  many 
actions,  the  dynamic  range  of  action  values  within  a  given  state  is  small,  and  the  implied 
policy  will  be  sensitive  to  generalization. 

The  change  in  the  difference  between  the  value  of  the  action  considered  optimal  in  x'  and 
the  maximum  value  over  sub-optimal  actions  in  x'  as  a  result  of  updating  the  value  of 
Qix,u)  will  likely  be  large,  thus  corrupting  the  learned  policy  in  x' .  The  policy  in  state  x 
may  accurately  reflect  the  optimal  policy  after  the  update,  ;r(x)  =  7r*{x)  .  However, 

because  of  the  large  sensitivity,  S(x,  x"  ,u),  the  policy  in  state  x  will  also  likely  be 
corrupted  as  a  result  of  updating  the  value  for  some  state-action  pair  Q  (x",u) .  Thus,  the 

number  of  training  updates  required  to  achieve  an  implied  policy  that  is  an  adequate 
approximation  of  tt*  may  be  very  large.  This  problem  is  not  a  property  of  any  particular 
function  approximation  system;  rather,  it  is  inherent  in  the  definition  of  ^-functions. 

3.4  Experiment  Set  #1 

3.4.1.  The  Hall  Problem 

For  the  purpose  of  illustrating  the  effects  of  a  large  An,  we  consider  the  following  class  of 

finite,  deterministic  MDPs.  Each  MDP  has  10(2')  states,  where  /  indicates  the  /'''  MDP. 

1  5(2') 

>  .  The  set  of  actions  possible  in 

J  ;=0 


The  set  of  states,  X,,  for  MDP  M,  is  X  = 


5(2') 
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each  state  for  every  MDP  in  the  class  is  \backward ,  foj'ward^ .  The  one-step  dynamics  for 
each  MDP  are  defined  by  the  following: 


1 


if  w  =  backward  and  x  >  0 


5(20 

0  if  M  =  backward  and  x  =  0 

1  if  x=l 

1 


X+  /  .X 

5(2') 


Otherwise 


1 


The  reward  fianction,  R,  is  defined  as  R.  (x,  u)  =  -r-v  for  all 

'  5(2') 

X  e'K-,u  e  {backward ,foward}  .  Finally,  for  each  MDP,  there  exists  a  single  terminal 
state,  x=l. 


Each  MDP  can  be  visualized  as  a  hallway  with  a  door  at  one  end  (x=l).  There  are  two 
actions  available  to  the  RL  system  in  each  state:  step  forward,  and  step  backward  (with 
the  exception  of  the  initial  state,  in  which  case  both  action  result  in  a  step  foiward).  The 
objective  is  to  successfully  exit  the  hallway  through  the  door. 

Each  MDP  differs  only  in  the  size  of  the  “step”  in  the  hallway  (i.e.  the  transition  distance 
from  X  to  x' ).  For  example,  MDP  Mo  has  a  total  of  five  non-terminal  states:  Xi={0,  0.2, 
0.4,  0.6,  0.8,  1 },  with  a  transition  distance  Ax  of  0,2.  MDP  Mi  has  a  total  of  10  non¬ 
terminal  states:  X2={0,  0.1,  0.2,  ...,  0.8,  0.9,  1},  with  atransition  distance  of  0.1,  The 
reward  for  each  transition  is  the  negative  of  the  transition  distance.  Therefore,  for  M2, 
R2(x,u)  =  -0. 1 . 


Each  graph  below  depicts  the  optimal  action- value  function,  Q\  for  selected  MDPs  from 
this  class,  given  that  rewards  are  discounted  by  y^\  In  each  experiment  the  RL  system  is 
initially  placed  at  the  end  of  the  hallway  opposite  the  door  (x=0).  Note  that  for  all  MDPs 
the  Q-values  in  state  0  are  identical  because  both  actions  cause  a  transition  to  state  0+Ax. 
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Figure  1:  MDP  Mq,  D^a^l  The  short  dashed  line  represents  dav,  the  difference  between  the  value  of 
moving  forward  and  backward  The  long  dashed  line  represents  the  difference  in  the  maximum  and 
minimum  state  values  over  the  entire  domain. 


X 


Figure  2:  MDP  Mi,  Dsa«4  There  are  twice  the  number  of  states  in  Mj  than  in  Mo.  The  value  of  d,y 
remained  unchanged  while  the  value  of  d^,,  was  reduced  by  approximately  one  half  in  all  states  other  than 
x=0.  Notice  this  pattern  holds  in  true  in  Figures  3  and  4  below. 


X 


Figure  3;  MDP  M2,  Dsa«8 
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Figure  4:  MDP  M3,  Dsassl6 


Figure  5:  MDP  M4, 

Figures  1-5  clearly  demonstrate  that  as  the  length  of  the  trajectory  through  state  space 
(measured  in  state  transitions)  increases,  the  significance,  with  respect  to  total  reward 
received,  of  performing  a  single  sub-optimal  action  decreases.  This  results  in  an  increase 
in  Afl  (i.e.,  a  decrease  in  for  each  state  relative  to  the  dynamic  range  for  state  values. 
As  stated  earlier,  the  policy  for  a  given  state  is  extracted  from  the  approximation  by 
taking  the  argmax  over  action  values  in  that  state.  Therefore,  the  action  values  in  each 
state  must  be  approximated  with  a  degree  of  accuracy  that  ensures  correct  relative  values. 
As  increases,  the  degree  of  accuracy  required  in  the  approximation  increases. 
Equivalently,  as  Aa  increases,  the  degree  of  sensitivity,  S,  in  the  implied  policy  increases. 
When  using  a  function  approximator  that  generalizes,  it  may  become  very  difficult  or 
impossible  to  achieve  an  adequate  approximation  of  tt*  . 
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3.4.2  Empirical  Results 

We  performed  sets  of  supervised  learning  experiments  to  demonstrate  the  validity  of  this 
assertion.  We  chose  to  use  supervised  learning  rather  than  RL  to  clearly  demonstrate  that 
the  objective  action-value  functions  are  at  issue,  not  other  characteristics  of  the  MDP  such 
as  the  number  of  states.  In  our  experiments  the  optimal  action- value  functions  of  MDPs 
Mo,  Mi,  M2,  and  M3  were  used  as  the  objective  functions.  In  each  experiment,  two 
CMACs  (Albus,  1981)  were  used  to  represent  the  action- value  function  (one  for  each 
action).  Each  CMAC  had  10  tilings.  Each  tiling  had  1 1  tiles,  each  of  which  covered  10% 
of  state  space  (excluding  boundaries).  In  other  words,  each  tile  had  an  effect  on  10%  of 
the  action  values;  i.e.,  generalization  extended  over  a  tenth  of  the  domain  for  all  MDPs. 

By  a  trial  we  mean  the  process  of  performing  parameter  updates  until  the  stopping 
criterion  is  met.  The  stopping  criterion  is  to  achieve  an  approximation  of  the  action-value 
function  that  implies  the  optimal  policy  in  every  state.  Batch  weight  updates  were 
performed;  the  parameters  of  the  function  approximator  were  updated  only  after  the 
presentation  of  all  input-output  pairs.  The  performance  measure  was  the  number  of 
weight  updates  required  before  achieving  the  stopping  criterion. 

For  each  MDP,  the  learning  rate  was  optimized.  Specifically,  for  a  given  learning  rate,  1 0 
independent  trials  were  performed  (each  initialized  with  a  different  random  number  seed), 
and  the  average  number  of  updates  required  to  achieve  the  stopping  criterion  was 
determined.  The  learning  rates  were  then  optimized  to  2  significant  figures.  The  results 
are  given  in  Figure  6: 
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1/AX- 

Figure  6:  Time  To  Learn  Policy 

The  number  of  updates  required  increases  with  a  decrease  in  transition  distance  (i.e.,  an 
increase  in  Dsa).  Figure  6  shows  that  the  rate  of  growth  in  the  number  of  weight  updates 
(as  a  function  of  transition  distance)  required  to  achieve  the  stopping  criterion  is  quite 
large. 

4.  ADVANTAGE  FUNCTIONS 

As  stated  in  the  previous  section,  the  greedy  action  implied  by  an  action-value  function  in 
a  given  state  is  determined  by  the  relative  action  values  within  that  state.  If  the  function  is 
represented  by  an  approximation  system  that  generalizes  across  states,  the  implied  policy 
may  be  sensitive  to  the  generalization.  The  degree  of  sensitivity  scales  with  Dsa.  In  cases 
where  the  penalty  for  one  wrong  action  in  a  sequence  of  many  actions  is  small,  the 
dynamic  range  of  action  values  within  a  given  state,  i4vi,  decreases,  and  the  implied  policy 
becomes  even  more  sensitive  to  generalization. 

One  solution  to  the  state-action  dynamic  range  problem  would  simply  be  to  exaggerate  the 
differences  in  action  values  within  a  state  (i.e.,  re-scale  the  action-value  dynamic  ranges, 
davi  and  dav2  ),  while  maintaining  the  state  values  for  all  states  (i.e.,  the  state-value  dynamic 
range,  dsv,  remains  unchanged),  thereby  decreasing  and  decreasing  the  sensitivity  of 
the  implied  policy  to  generalization.  Note  that  simply  scaling  the  function  by  a  constant 
does  re-scale  davu  but  it  also  re-scales  dsv  by  the  same  amount  and  therefore  does  not 
achieve  the  goal  of  decreasing  Ds„. 
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4.1  Derivation  Of  Advantage  Function 

We  begin  by  defining  an  operator,  F,  on  the  space  of  action-value  functions  that  achieves 
our  desired  goal: 

F:0  ^  0 ,  where  ©  =  {/I/:  X  X  ^  9t} . 

Given  a  function  c  g 0  such  that  c{x,n)>\  for  all  x  &  X and  all  u  eU ,  our  objective  is  to 
find  an  F  such  that  for  all  x  gX  ,  u,u'  g  U  ,  and  Q  e  0 ,  two  properties  are  satisfied. 

Letting /1=F’(0,  they  are: 

0  max  A(x,  ii)  =  max  Q{x,  n) , 

u  u 

ii)  A(x,u')-  A(x,u)  =  c{x,u')\Q{x,u')  -  Q{x,u)\. 

Property  (/)  ensures  that  the  state-value  dynamic  range,  dsv,  does  not  change  as  a  result  of 
the  transformation.  Property  (//)  ensures  that  the  action-value  dynamic  range,  is 
increased  by  a  factor  of  c{x,u\  thereby  decreasing  A«- 

Is  there  an  F  that  satisfies  (0  and  (//)  and,  if  so,  is  it  unique?  We  will  show  that  such  an  F 
does  exist  and  is  unique  and  is  given  by: 


A{x,u)  =  max  2(x,  w' ) -c(x,m) 


max Q{x, u')-  Q(x,  14) 


(3) 


where  u  and  it'  are  arbitrary  actions  in  state  x.  ^  The  derivation  of  Equation  3  follov/s: 


Rewriting  (ii),  we  see  that  A(x,u)  =  c(x,u')Q(x,u)  +  A(x,u')  -  c(x,u')Q(x,u') . 
Substituting  for  A(x,u)  in  (/)  produces 


^  On  first  observation  it  might  appear  appropriate  to  rewrite  (1)  as  a  weighted  average,  weighted  by  c  and 
by  (1  -  c) .  However,  we  choose  c>l  for  all  state-action  pairs. 
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max[c(x,  w')5(x,w)  +  >4(x,w')“C(x,w')0(x,w')]  =  max(?(x,w)  for  any  w’ 


^(x,w')  +  c(x,w') 


=  max  Q(x,  u) 


maxQ(x,u)-Q{x,u') 

u 

A{x,u^)  =  max0(x,w)“C(x,w')] 
u  I 

which  is  tlie  same  as  A(x,  u)  =  max  Q(x, «' )  -  c(x,  u) 


max  Q(x,  u)  -  Q(x,  u' ) 


max  Q{x,  u')-Q{x,  u) 


Thus,  for  a  given  function  c  such  that  c(x,m)>1  for  all  (x,u)  pairs,  the  derivation  shows  that 
F  exists  and  is  the  unique  operator  that  has  properties  (/)  and  (//).  We  call  A=F(Q)  the 
Advantage  function  (Baird,  1993)  derived  from  Q.  Note  that  the  0-function  is  a  special 
case  of  an  Advantage  function.  If  c(x,u)=l  for  all  (x,u).  Equation  3  reduces  to  the  original 
0-function. 


Figure  7  graphically  demonstrates  the  results  of  transforming  the  0-function  into  the 
corresponding  Advantage  function  when  c  is  a  constant. 


Values 
in  state 


Actions 


Figure  7:  The  action  values  in  state  x  before  and  after  transforming  the  (7-function  into  an  Advantage 

function. 

For  a  single  state,  x,  both  the  original  action-value  function  and  the  resulting  Advantage 
function  are  plotted  in  Figure  7.  The  dashed  line  indicates  the  maximum  action- value  in 
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state  X,  which  is  by  definition  the  value  of  the  state,  V{x).  As  is  required  by  property  (/), 
max  A(x,u)  =  max  Q(x,u) ,  which  ensures  that  the  policy  remains  unchanged  as  a  result 

of  the  transformation.  Neither  has  dsv  been  affected  by  the  transformation.  Observe  that 
the  difference  in  action  values  in  the  original  function  (^-function)  is  small.  However, 
after  the  transformation,  the  differences  in  action  values  in  the  new  function  (Advantage 
function)  have  been  greatly  exaggerated.  The  dynamic  range  of  action  values,  dav,  within  a 
given  state  are  controlled  by  c.  In  other  words,  davi  has  been  scaled  by  a  factor  of  c  and 
dsv  has  not  been  changed,  so  Dsa  has  been  reduced  by  a  factor  of  c,  resulting  in  a 
diminished  degree  of  sensitivity. 


Choosing  c 


Ideally,  one  would  like  to  choose  the  function  c  such  that  Dsa=l  for  all  states.  A  function 

that  satisfies  this  criterion  is  called  c* .  It  will  rarely  be  the  case  that  we  have  enough 

information  a  priori  to  determine  c* .  However,  it  is  not  difficult  to  develop  heuristic 

approximations  of  c*that  result  in  “good”  Advantage  functions.  For  example,  if  the  MDP 

is  an  approximation  of  an  underlying  continuous  time  system,  then  a  good  approximation 
*  .  1 

of  c  might  be  — —  where  Ai'  is  the  time  step  duration  for  the  chosen  action  and  K  is  an 
K/\t 


arbitrary  constant.  As  At  is  halved,  the  change  in  the  total  discounted  reward  received  as  a 
result  of  performing  a  single  sub-optimal  action  is  halved.  In  other  words,  dav  is  halved 
and  Dsa  is  doubled,  resulting  in  increased  sensitivity.  Choosing  c{x,  u)  =  counteracts 


the  increase  in  Dsa  and  causes  the  underlying  Advantage  function  to  remain  independent  of 
At  for  small  At. 


4.2  Example;  Hall  Problem 

To  illustrate  that  the  policy  implied  by  the  Advantage  function  is  less  sensitive  to 
generalization  than  the  original  0-fianction,  we  return  to  the  Hall  Problem  described  in 
Section  3.4.  Here  we  describe  a  set  of  experiments  identical  to  those  presented  in  Section 
3.4.2.  with  two  exceptions:  the  function  approximated  was  the  Advantage  function  and 
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no  optimization  was  performed  on  the  learning  rate.  The  optimal  scaling  function  c  was 
approximated  by  letting  c{x,u)  equal  A7Ax  for  all  (x,m)  pairs,  where  Ax  is  the  transition 
distance  and  K=0.2  is  a  scaling  factor  chosen  to  cause  Q*  and  A*  to  be  the  same  function 
for  MDP  Mo-  For  example,  c(x,w)=A7Ax=.2/.2=l  for  all  (x,m)  pairs  in  Mo,  which  results  in 
A{x,u)=Q{x,ii)  for  all  state-action  pairs.  The  results  are  summarized  in  Figure  8. 


1000  T 


0  5  10  15  20  25  30  35  40 

1/Ax 


Figure  8:  Comparison  of  time  to  learn  optimal  policy  in  ^-function  and  Advantage  function 

These  results  demonstrate  that  the  number  of  epochs  required  to  achieve  every  state 
is  independent  of  the  number  of  states  in  an  optimal  trajectory.  It  is  always  the  case  that  a 
function  c  exists  that  causes  D^a  to  remain  unchanged  as  r/avi  changes.  Using  the  heuristic 
described  above  for  choosing  the  function  c  resulted  in  a  Dsa  of  approximately  2.2  for  each 
of  the  Advantage  functions.  Therefore,  we  performed  supervised  learning  on  essentially 
the  same  objective  fiinction  for  each  MDP.  This  is  demonstrated  in  Figures  9-12  below. 
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X 

Figure  1 1 :  MDP  M2,  ^5^=2. 40 
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X 


Figure  12:  MDPM3,  Dsa-2A6 

4.3  Semi-Markov  Decision  Processes;  Addressing  Daa 

Section  3.2.1.  defined  variables  to  describe  two  distinct  properties  of  action-value 
functions.  We  illustrated  the  effects  of  the  first,  state-action  dynamic  range  (Ao),  on  the 
degree  of  sensitivity  in  the  policy  with  respect  to  small  changes  in  the  parameter  vector  of 
the  function  approximator.  Here  we  address  the  second  property,  action-action  dynamic 
range  (/)aa),  and  discuss  how  it  may  exaggerate  sensitivity. 


4.3.1.  Action- Action  Dynamic  Range  and 

In  Section  3.2.1.  Daa(x,u)  was  defined  to  be  equal  to  2(^ 

(x,  u)  =  max  Q{x,  u)  -  Q{x,  u)  and  (x)  =  max  Q{x,  u)  -  min  Q(x,  u) .  This  ratio 

^  W  J  u  u 

is  demonstrated  graphically  in  Figure  13. 
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It  may  be  that  the  degree  of  sensitivity  in  the  implied  policy  in  state  x  is  directly  related  to 
the  magnitude  of  max  D^^(x,u) .  Why  should  this  be  the  case? 

U 

Consider  a  variation  of  the  Hall  problem  presented  in  Section  4.2.  In  this  variation  the  hall 
is  replaced  with  a  tight  rope  suspended  between  two  platforms  high  above  a  concrete 
floor.  An  RL  acrobat,  with  perfect  balance,  is  given  the  task  of  crossing  from  one 
platform  to  the  other  in  the  fewest  number  of  steps.  The  class  of  MDPs  is  parameterized 
by  the  height  of  the  platform  from  the  floor.  In  each  state  the  acrobat  can  perform  one  of 
three  possible  actions:  1)  step  forward,  2)  step  backward,  and  3)  step  to  the  side.  Each 
step  moves  1  foot  along  the  rope  and  incurs  unit  cost.  A  step  to  the  side  (a  step  oflF  of  the 
rope)  incurs  a  cost  equal  to  the  height  of  the  platform  from  the  floor,  however  the  acrobat 
is  secured  by  a  tether  that  allows  her  to  climb  back  to  the  location  on  the  tight  rope  from 
which  she  fell  with  no  cost.  All  costs  (negative  rewards)  are  undiscounted.  The  length  of 
the  tight  rope  is  50  feet,  resulting  in  i4v=50.  If  we  assume  a  platform  height  of  51  feet, 
then  for  all  x,  the  ideal  state-action  dynamic  range.  However,  a  new  problem  now 
exists.  A  plot  of  the  values  for  the  three  actions  in  state  25  (the  middle  of  the  rope)  is 
presented  in  Figure  14. 
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backward  forward 


side 


Figure  14;  Z)jo=l,  D^^=25 


To  adequately  approximate  n-* ,  the  function  approximator  must  accurately  reflect  the 
ordering  of  the  values  for  fomard  and  backward.  The  acrobat  will  devote  much  function 
approximator  resources  to  representing  the  knowledge  that  stepping  to  the  side  will  result 
in  a  very  large  cost,  but  this  will  come  at  the  expense  of  accurately  discriminating  between 
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the  values  of  stepping  forward  and  stepping  backward.  The  degree  of  accuracy  required 
for  differentiating  between  these  two  actions  increases  with  the  height  of  the  platform. 

This  will  necessarily  result  in  an  increase  in  the  sensitivity  in  the  implied  policy  as  well. 
Consider  what  happens  if  we  double  the  height  of  the  platform.  Figure  15  demonstrates 
that  it  becomes  increasingly  difficult  to  differentiate  between  the  values  of forward  and 
backward.  Essentially,  the  acrobat  will  quickly  learn  not  to  take  a  step  to  the  side,  but  it 
will  not  easily  be  able  to  determine  if  it  should  step  forward  or  backward.  As  we  discuss 
in  the  next  section,  approximating  Advantage  functions  rather  than  ^-functions  solves  this 
problem  as  well,  given  an  appropriate  choice  of  c. 
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Figure  15:  Dsa=0.5,  Daa=50 


4.3.2.  Semi-Markov  Decision  Processes 

In  previous  sections  we  have  assumed  that  actions  are  performed  at  each  of  a  sequence  of 
unit  time  intervals.  However,  if  using  a  hierarchical  RL  framework,  actions  at  any  of  the 
abstract  levels  in  the  hierarchy  may  be  made  at  vaiydng  integral  multiples  of  the  unit  time 
interval.  The  interval  between  actions  may  be  predetermined  or  random.  Also,  if  a 
continuous-time  decision  problem  is  treated  as  a  discrete-time  system  where  actions  are 
made  upon  change  of  state,  actions  may  be  made  at  varying  time  intervals.  In  these  cases, 
the  framework  is  known  as  a  semi-Markov  decision  process  (SMDP).  Such  processes  can 
be  treated  the  same  as  MDPs  to  a  large  extent  by  taking  the  reward  on  each  discrete 
transition  as  the  integral  of  the  reward  over  the  corresponding  continuous-time  interval  for 
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the  continuous-time  case,  or  the  sum  of  the  rewards  over  the  duration  of  the 
corresponding  sequence  of  unit  time  intervals  in  the  discrete-time  case. 

In  the  SMDP  framework  it  may  be  likely  that  Aa  is  quite  large.  If  rewards  for  actions  are 
a  function  of  the  duration  of  the  transition,  then  it  will  be  quite  common  for  action  values 
within  a  given  state  to  differ  greatly,  resulting  in  a  large  D^.  In  the  special  case  that  the 
action  set  within  a  given  state  is  a  combination  of  “primitive”  actions  (actions  that  have  a 
duration  of  unit  time)  and  “macro”  actions  (actions  that  have  a  duration  of  more  than  unit 
time),  Z)aa  will  very  likely  be  large.  Only  in  the  case  that  the  macro  actions  have  roughly 
the  same  values  as  the  primitive  actions  will  Aa  not  be  an  issue  with  regard  to  sensitivity. 

Large  action-action  dynamic  range  ratios  are  not  restricted  to  SMDPs.  They  may  also 
occur  in  MDPs,  as  demonstrated  in  Section  4.3.1. 

4.3.3  Revised  Hall  Problem 

To  demonstrate  the  relationship  between  sensitivity  and  Aa  we  again  use  a  variation  of  the 
Hall  problem.  The  problem  specification  is  changed  by  adding  macro  actions  to  the  action 

set  in  each  state.  The  length  of  the  hallway  is  fixed  at  one  hundred  primitive  steps. 
Specifically,  A'=  {-49 -48,-47,...,-!, 0,1,..., 48,49,50}  and  U„  =  {mleft, left, right, mright}  v/herQ 

mleft  and  mright  are  macro  actions  and  n  is  the  length  of  these  actions  measured  in  state 
transitions.  The  class  of  MDPs,  M,  is  parameterized  by  n.  State  space  “wraps  around” 
forming  a  cycle.  For  example,  / (50, right)  =  -49  .  As  in  the  original  problem 

specification,  there  exists  a  single  terminal  state,  jr=0.  The  reward  for  each  primitive 
action  is  0.01 .  The  reward  for  each  macro  action  is  «(.01)  where  ti  is  the  number  of  state 
transitions. 

In  each  of  the  examples  presented  below,  we  assume  the  rewards  are  not  discounted  and 
are  being  minimized.  We  begin  by  comparing  the  ^-function  and  Advantage  frinction  for 
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Ml.  In  Ml  the  primitive  actions  and  macro  actions  are  of  the  same  duration,  1  state 
transition. 


- Q(x,m!eft) 

O  Q(x.left) 

- Q(x, right) 

X  Q(x,mright) 


Figure  16:  The  optimal  Q-function  for  Mi,  the  length  of  the  macro  actions  is  1  unit. 

A,=25,  D,,=l. 


State  (x) 


+  A(x,mleft) 
OA(x,left) 
AA(x,right) 
XA(x,mright) 


Figure  17:  The  optimal  Advantage  function  for  Mi  with  c=25  for  all  state-action  pairs,  including  macros. 

D,a=l,D^=l 

The  state-action  dynamic  range  ratio,  Aa,  is  quite  large  in  the  optimal  ^-furction  for 
MDP  M],  shown  in  Figure  16,  A  choice  of  c(x,n)=25  for  all  state-action  pairs  (a  choice  of 
c)  results  in  an  Advantage  function  with  a  An  ratio  of  1 .  The  action-action  dynamic 
range  ratio,  Aa,  for  both  functions  is  1  because  the  duration  of  the  macro  actions  equals 
the  duration  of  the  primitive  actions.  However,  it  is  important  to  consider  how  Aa 
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changes  as  the  duration  of  the  macro  actions  increases.  If  we  maintain  a  value  of  25  for  all 
state-action  pairs  in  c,  then  Aa»2  in  M2,  Daa»4  in  M4,  and  Aa^S  in  Mg. 

The  optimal  ^-function  for  MDP  Mie  is  presented  in  Figure  1 8  below.  Each  macro  action 
has  a  duration  of  16  state  transitions.  The  ^-function  now  has  2  large  dynamic  range 
ratios,  Aa~25  and  Aa~16.  These  ratios  are  presented  as  approximations  because  the 
actual  values  now  vary  as  a  function  of  state  and  action. 


+  Q(x,m!eft) 
OQ(xJeft) 

A  Q(x, right) 
XQ(x,mright) 


Figure  18:  The  optimal  g-function  for  MDP  Mi6.  Z)aa«16. 

The  Advantage  function  for  MDP  Mie,  choosing  the  optimal  c  for  each  state-action  pair,  is 
presented  in  Figure  19.  Note  that  using  c*  results  in  each  sub-optimal  action  have  unit 
distance  from  the  optimal  action  in  each  state. 


+  A(x,mleft) 
OA(x,left) 
AA(x,right) 
XA(x,mright) 


Figure  19:  The  optimal  Advantage  fiinction  for  MDP  Mi6  using  c*.  Dsa=l,  Aa==F 
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However,  as  stated  in  Section  4.1,  it  is  rarely  the  case  that  we  have  enough  information  to 
determine  the  function  c* .  So  it  is  reasonable  to  question  the  quality  of  the  Advantage 
function  that  results  from  using  heuristics  to  choose  the  function  c.  In  the  Pall  problem 
presented  in  Section  4.2  we  chose  to  use  a  constant  value  for  the  c  function.  The  value 
chosen  for  c  was  based  on  the  change  in  state,  Ax,  resulting  from  performing  a  single 
action.  This  accomplished  the  goal  of  re-scaling  the  state-action  dynamic  range  ratio,  Dsa, 
and  resulted  in  greatly  decreasing  the  time  required  to  achieve  the  stopping  criteria  in  our 
experiments.  Can  we  expect  a  constant  scaling  function  c  to  produce  similar  results  for 
SMDPs?  As  stated  earlier,  as  the  duration  of  the  macro  actions  increases  the  action-action 
dynamic  range  ratio  also  increases.  The  Advantage  function  for  MDP  Mig  with  a  choice 
of  c(x,m)=25  for  all  state-action  pairs  is  presented  in  Figure  20  below. 
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Figure  20:  The  optimal  Advantage  function  for  MDP  Mie  with  a  choice  of  c(x,u)=25  for  all  state-action 

pairs.  Daa«16 

This  choice  of  c  results  in  an  action-action  dynamic  range  ratio,  Daa,  as  large  as  16  in  many 
states.  However,  Figure  20  suggests  a  heuristic.  Let  the  term  regret  be  defined  as  the 
change  in  the  total  reward  received  as  a  result  of  performing  a  single  sub-optimal  action. 
For  the  problem  at  hand,  we  define  a  single  unit  of  regret  to  be  equal  to  the  change  in  the 
total  reward  received  as  a  result  of  performing  a  single  sub-optimal  primitive  action. 

Using  this  definition  it  is  clear  that  performing  a  single  sub-optimal  macro  action  can  result 
in  as  much  as  16  units  of  regret  (16  times  more  than  a  single  primitive  action).  This 
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suggests  scaling  the  value  of  by  a  factor  of  1/16  for  all  macro  actions.  More 
generally,  let  the  choice  of  c  be  a  function  of  the  duration  of  the  action.  For  MDP  Mi,  the 
duration  of  the  macro  actions  and  primitive  actions  are  the  same.  Therefore  we  can  use  a 
constant  scaling  factor  of  25  for  all  state-action  pairs.  For  MDP  M4  let 
c{x^mleff)~c(xjefty A— 6.25^  and  in  Mg  c{x^mlef{)—cixJ.efi)l%=Z.  125.  Using  this  heuristic, 
we  construct  the  following  function:  c(x,2/)=Z7/(iO  where  K  is  the  state-action  dynamic 
range  ratio  scaling  factor  and  has  a  value  of  25,  and  l{ii)  is  the  action-action  dynamic  range 
ratio  scaling  factor  and  equals  the  duration  of  action  u  measured  in  state  transitions.  The 
resulting  Advantage  function  is  shown  in  Figure  21. 


state  (X) 


+  A(x,mleft) 
OA(x,left) 
AA(x,right) 
XA(x,mright) 


Figure  21:  The  Advantage  function  resulting  from  a  choice  of  c(x,w)=r//(w)  where  K  is  the  D,a  scaling 

factor  and  l(u)  is  the  scaling  factor. 


4.3.4  Empirical  Results 

Again,  we  performed  supervised  learning  experiments  to  demonstrate  the  desirability  of 
approximating  Advantage  functions  over  ^-functions.  In  our  experiments  the  optimal 
action-value  functions  of  MDPs  Mi,  M2,  M4,  and  M16  were  used  as  objective  functions.  In 
each  experiment  a  double-hidden-layer,  sigmoidal  network  was  used  to  represent  the 
action-value  function.  Each  hidden  layer  contained  2  bipolar  sigmoids  with  a  range  (-1,1). 
The  inputs  of  the  network  were  the  state,  x,  the  action,  //,  and  a  bias.  All  parameters  were 
initialized  to  random  values  between  -1  and  1. 
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Again,  by  a  trial  we  mean  the  process  of  performing  parameter  updates  until  the  stopping 
criteria  is  met.  The  stopping  criterion  is  to  achieve  an  approximation  of  the  action-value 
function  that  implies  the  optimal  policy  in  every  state.  Batch  weight  updates  were 
performed;  the  parameters  of  the  function  approximator  were  updated  only  after  the 
presentation  of  all  input-output  pairs.  The  performance  measure  was  the  number  of 
parameter  updates  required  before  achieving  the  stopping  criterion. 

After  much  effort  and  optimization,  the  stopping  criteria  was  never  achieved  when 
approximating  the  optimal  ^-function  for  the  simplest  MDP  in  the  class,  Mi  (see  Figure 
16).  However,  the  stopping  criteria  was  achieved  after  650  epochs  when  approximating 
the  optimal  Advantage  function  for  the  same  MDP  (see  Figure  17). 

Averaged  over  multiple  trials  with  different  random  number  seeds,  the  system  performed 
an  average  of  650  epochs  before  achieving  the  stopping  criteria  for  MDPs  Mi,  M2,  and 
M4.  The  system  performed  an  average  of  850  epochs  before  achieving  the  stopping 
criteria  for  MDP  Mie  (see  Figure  21). 

To  demonstrate  the  effects  of  a  large  action-action  dynamic  range  ratio,  Aia,  in  a  second 
set  of  experiments  we  chose  to  use  a  constant  scaling  function  c  with  a  value  of  25  (c*  for 
MDP  Ml)  for  MDP  Mie.  This  choice  of  c  resulted  in  an  action-action  dynamic  range  ratio 
of  16  in  many  of  the  states.  As  expected,  even  after  much  optimization  the  system  was 
never  able  to  achieve  the  stopping  criteria. 

5.  Conclusion 

We  have  presented  one  approach  to  reducing  sensitivity  resulting  from  a  large  state-action 
dynamic  range  and/or  a  large  action-action  dynamic  range.  One  might  ask  if  it  is  possible 
to  address  sensitivity  issues  by  choosing  an  appropriate  function  approximator  rather  then 
changing  the  objective  function.  The  answer  is  yes.  Specifically,  one  might  eliminate  all 
sensitivity  resulting  from  a  large  action-action  dynamic  range  by  using  a  separate  function 
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approximator  for  each  action.  Likewise,  one  might  eliminate  all  sensitivity  resulting  from 
a  large  state-action  dynamic  range  by  using  a  separate  function  approximator  for  each 
state.  Of  course,  this  solution  relegates  our  choice  of  function  approximator  to  a  simple 
look-up  table. 

It  will  certainly  be  possible  in  some  cases  to  hand  craft  a  function  approximator  to  work 
with  a  given  MDP,  given  enough  a  priori  information  about  the  optimal  action-value 
function.  However,  we  propose  that  approximating  Advantage  functions  is  a  much  more 
general  and  robust  approach. 


Appendix 

Advantage  Learning 

Given  Equation  3  we  can  derive  an  RL  algorithm  that  finds  an  approximation  of  the 
Advantage  function.  We  call  this  algorithm  Advantage  Learning  (Harmon  and  Baird, 
1996a).  We  begin  by  constructing  a  Bellman  equation  for  Advantage  Learning, 


A{x,ii)  =  maxg(x,w')~c(x,w) 

u' 

A{x,  u)  =  max  Q{x,  w’ )  -  c{x,  a) 

u’  I 

A(x,u)  -  max^(x,w*)  -c(x,  w)| 


max  Q{x,  w’ )  -  Q(x,  u) 


max  Q(x,  u' )  -  R(x,  u)  +  y  max  Q{x' ,  u)  1 

U'  ^  U  A 

rmviA{x,u')-E{R{x,u)+y  vaax.  A{x' ,u) 


(4) 


where  E  indicates  the  expected  value  of  performing  action  u  in  state  jc,  and  x'  is  the  state 
resulting  from  choosing  action  ii  in  state  x.  A  standard  backup  operation  is  given  in 
Equation  5. 


A{x, u) <—  max A(x, u')- c{x, u)  max A(x, u')- E\ R{x, u)  +  / max 

u'  L  ^ 

However,  for  reasons  beyond  the  scope  of  this  document,  Equation  5  is  not  guaranteed  to 
converge  when  using  a  lookup  table  as  the  function  approximator.  Therefore,  we  define 
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Advantage  Learning  as  a  residual  algorithm  (Baird,  1995).  The  Bellman  residual  is  the 
difference  in  the  two  sides  of  Equation  4.  The  mean  squared  Bellman  residual  for  an 
MDP  with  n  states  is  therefore  defined  to  be: 


MSBR  =  “ 2 [  max  A(x, u')- c(x, u)  max  A(x, u')-  e{ R{x, u)  +  y  max  A(x' , ?r)') 

H  \  W  m’  V  2^  ^ 


-  A(x,u) 


Advantage  Learning  performs  gradient  descent  on  the  MSBR,  and  is  by  definition  a 
residual  algorithm.  The  vector  of  parameters,  W,  in  the  function  approximation  system  is 
updated  according  to  Equation  6  below. 


AW  =  -al  maxy4(x,w') -c{x,u)\ 


R{x,  u)  +  y  max  A{x'  ,u'))-  max  A{x,  u) 


■  max  A(x,  u')-  <jx:{x,  u) 
<W  w 


d 

y  A(x\u')-  —  max  A{x,  u) 

<yW  w  (TW 


-  A{x,u)j» 


_d_ 


A(x,  u) 


) 


(6) 


where  a  is  the  step  size  parameter  and  ^  is  a  constant  that  controls  a  trade-off  between 
pure  gradient  descent  (when  <j)  equals  1)  and  a  fast  direct  algorithm  (when  (f)  equals  0).  For 
a  full  discussion  of  residual  algorithms  see  Baird(1995). 
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